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Test-wiseness was introduced as a construct at least four decades ago. Thomdike 
(1951), discussing sources of variation entering into observed test score differences, identified 
test-wiseness as a persistent, general attribute of the examinee that would contribute in part to 
differences among individuals. In their seminal article, Millman, Bishop and Ebel (1965) 
identified the concept of test-wiseness, articulating their explanation with a proposed taxonomy 
of test-wiseness skills. Millman et at. defined test-wiseness as "a subject's capacity to utilize 
the characteristics and formats of the test and/or the test taking situation to receive a high 
score" (p. 707). They further asserted that test-wiseness "is logically independent of the 
examinee's knowledge of the subject matter for which the items are supposedly measures" (p. 
707). A refinement offered by Millman et al. was that of separating test-wiseness skills into 
two broad domains, skills that are logically independent of the test purpose or test constructor 
(class I), and skills that are dependent on the test purpose or test constructor (class II). 

Research on test-wiseness suggests that: (a) differences in test scores do correlate, to 
varying degree, with test-wiseness (Samacki, 1979); (b) test-wiseness skills can be taught and 
learned by examinees as young as upper elementary school grades (Samacki, 1979; Samson, 
1985; Dolly & Williams, 1986); and (c) many times teacher-made tests include cues that 
would make the items artificially easier for test-wise examinees (Brozo, Schmelzer & Spires, 
1984). However, there has been very little research focused upon the degree to which 
different test-wiseness skills might well not be of equal difficulty to learn or to apply in a 
testing situation. To date, the related literature has been scarce. 

Differences Among Test-Wiseness Skills for Adults 

The studies discussed in this section stemmed from investigations of whether poor item- 
writing practices, as stated in textbooks on measurement or test construction, actually 
influenced examinee performance on or the technical characteristics of tests. The basic 
research design was that of taking what were considered acceptable test items and rewriting 
them to reflect various of the item writing flaws pointed out in texts and administering the 
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items in counterbalanced fashion to examinees. The degree to which average performance 
differed on the items written to include flaws versus the original versions was thought to be 
indicative of the impact of the poor item- writing practices. Generally, only a few of the test- 
wiseness skills were incorporated in these studies. 

Dunn and Goldstein (1959) tested 832 Army trainees during the eighth week of basic 
training on four-option multiple choice items covering four subject areas. Twenty-five items 
were written to reflect each of various three test-wiseness cues: inclusion of irrelevant cues or 
specific determiners (the Millman et al. category for specific determiners is II.B.3), grammar 
cues (II.B.l.h), or having the longest alternative be the correct choice (II.B.l.a). The mean 
item p-value (proportion of examinees correctly responding) for each type of item was, in 
order, .54 for cues/specific determiners, .55 for grammar cues, .59 for length, and .61 for 
items having both length and granraiar cues. When compared with "unflawed" items, the 
mean differences in p-values were .03, .03, .07, and .09 for cues, grammar, length, and 
length plus grammar. Thus, in this study, length (in this case the longest answer being 
correct) would appear to be the type of cue most easily detected by examinees uninstructed in 
test-taking skills. 

Board and Whitney (1972) re-wrote acceptable items from an undergraduate course in 
American Politics to reflect various item writing flaws; those that related to test-wiseness skills 
included: (a) keyed responses noticeably longer or shorter (II.B. l.a), and (b) grammatical cue 
to keyed response (II.B. l.h). Fifteen items for each skill (or flaw) were used, and were tested 
in both flawed and unflawed form. Each set of items was administered to 80 undergraduates 
(160 total) who had been blocked into five levels based on their performance on an 
unadulterated test of course content. Overall, the mean p-value for the length cue items was 
.64, while that for granmiar cue items was .67. The difference between the flawed and 
unflawed versions of items was negligible, however, about .01 for length cue and .00 for 
grammar cue. Board and Whitney noted a statistically significant item version by ability level 
interaction for the length cue items. By quintile, the mean p-value difference for the flawed 
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items was .08. .04, . 10. -. 15, and .00 for the fifth to the first ability level, respectively. Thus 
the lower ability students were better able to capitalize on this type of cue, but their 
performance was lower than that of the upper ability students. 

Weiten (1984) generated eight "flawed" or test-wiseness cue-laden and eight "good" 
items for each of four test-wiseness cues: (a) Item stem-answer resemblance (Millman et al. 
category II.B.4). (b) grammar cue (II.B.l.h), (c) implausible or absurd alternatives (I.D.I), 
and length of (longer) correct alternative (II.B. 1 .a). When these items were administered to 
54 undergraduate students enrolled in a child psychology course, the mean p-vaiues were, in 
order, .59 for stem-answer similarity, .60 for grammar and for length, and .71 for absurd 
alternatives. The mean differences in p-value across flawed and good versions of the items 
were .03 for grammar, .09 for length, . 10 for stem-answer resemblance, and .18 for absurd 
alternatives. 

In all, the studies involving adult respondents have addressed from two to fovur test- 
wiseness skills; in all five such skills were investigated. The findings of these studies are 
summarized in Table 1 . 

Differences Among Test-Wiseness Skills for Children 

With one exception, the studies cited in this section were not specifically oriented 
toward evaluating the relative difficulty of test-wiseness skills. As was the case for studies 
involving adults, most studies involving children only addressed a small number of test- 
wiseness skills. 

Slakter, Koehler, and Hampton (1970a, 1970b) administered a test measuring four of 
the Milhnan et al. skills, each by four multiple choice items: (a) stem-answer resemblance, (b) 
options known to be incorrect (I.D.I), (c) similar options (I.D.2), and (d) specific determiners 
in options (II.B. 3). In the first study (Slakter et al., 1970a), these items were administeted to 
approximately 2360 students in grades 5-11 in small school districts in western New York and 
northern Michigan. Though means were not given, the relative order of mean p-values was 
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stable across the skills for five of the seven grades. From easiest to most difficult, the 
sequence was known incorrect options, stem-answer resemblance, similar options, and specific 
determiners. 

In the second smdy (Slakter et al. , 1970b), the authors administered the same test to 76 
high school seniors who had been trained in test-wiseness skills and to 85 seniors who had not 
been trained. On the immediate posttest after training, the mean item p-values for the trained 
smdents were: .75 for specific determiners, .80 for similar options, .82 for stem-answer 
resemblance, and .86 for known incorrect options. These values were in the same order as 
found for the smdents in grades 5-11. However, Uie difference in mean p-values between the 
trained and untrained groups was just the reverse; the largest difference (.33) was observed for 
specific determiners, followed by .28 for similar options, .15 for stem-answer resemblance, 
and .02 for known incorrect options. Thus, what apparently were the more difficult skills to 
demonstrate were those on which the greatest gains due to training were observed. 

Diamond and Evans (1972) generated six four-option multiple choice items to measure 
each of five test-wiseness skills. These were administered to 95 sixth grade students from a 
suburban Philadelphia school. In order of increasing mean p-value. the skills were: (a) 
grammar cue (II.B.l.h) = .35, (b) overlapping distractors, such that the truth of one implies 
the correcmess of several others (I.D.2) = .45, (c) specific determiners (n.B.3) = .50, (d) 
length of correct alternative (II.B.l.a) = .53, and (e) alliterative association (n.B.4) = .77. 
An interesting twist to this study was that these subscores were correlated with the Lorge- 
Thomdike IQ test, and observed moderate correlations for specific determiners (.43), grammar 
(.46) and alliterative association (.51), but not for length of correct alternative (.21) or 
overlapping distractors (.05). 

Diamond, Ayrer, Fishman, and Green (1976) administered the same set of items used 
by Diamond and Evans (1972) to 40 fifth grade and 36 sixth grade smdents at an inner city 
school in Philadelphia. In order of difficulty, the overall mean p-values were: (a) grammar 
cue = .23, (b) specific determiners = .31, (c) overlapping distractors = .34, (d) length of 
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correct alternative = .39, and (e) alliterative association = .45. The ranking of these skills 
was exactly as was observed with higher ability students in Diamond and Evans. In general, 
the sixth grade students in Diamond et al. outperformed the fifth grade students, the two skills 
being exceptions in which roughly equal mean performance was observed included grammar 
cues and overlapping distractors. 

McMorris, Brown, Snyder, and Pruzek (1972) constructed seven "flawed" and seven 
"clean" items for each of three test-wiseness skills to be administered to 494 eleventh grade 
students in a suburban New York school district. In order of mean difference in p- value 
favoring flawed over clean items, the skills included: (a) length of correct alternative = .03, 

(b) grammar cues = .07, and (c) stem-correct answer resemblance = .09. 

Carter (1986) in an unusual study, administered one item for each of five test-wiseness 
skills: (a) having choice "C" be the correct answer (II.B.l.d), (b) length of correct alternative, 

(c) alliterative association, (d) grammar cue, and (e) "+/- options" in which one option was 
positively stated and the other three were negatively stated (II.B. l.f). The items were 
administered to 312 seventh grade students. In increasing order of p-value, the skills were 
grammar cue = .27, longer correct alternative = .50, alliterative association = .55, choice 
"C" = .69, and +/- options = .80. Subsequent interviews with some of the participants 
suggested, however, that the +/- options item suffered also from several absurd alternatives in 
the negatively stated choices. 

The only study that explicitly appraised relative difficulty of certain of the Millman et 
al. test-wiseness skills was reported by Morse (1980). Twelve skills were appraised, each with 
from four to six items per skill, by administration to about 2900 fifth and sixth grade students 
in 30 Mississippi school districts. Using Rasch model logit scaling, the mean difficulty of the 
various skills was, in increasing order: (a) guess when there is no penalty (I.C.I, -2.15), (b) 
look over test before starting (I.A.2/I.A.3, -1.20), (c) read and follow directions (I.B.I, -.80), 

(d) Look for cues elsewhere in the test (I.D.5, -.23), (e) change your answer if you believe 
your first choice is wrong (I.A.5, -.02), (f) specific determiners (II.B. 3, .04), (g) stem-item 
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resemblance (II.B.4, .10), (h) grammar cue (II.B.l.h. .14), (i) length of correct alternative 
(II.B.l.a, .22), (j) Budget your time and check your progress (I.A.2, .74), and (k) Don't 
choose your answer from a set of similar answers (I.D.2/I.D.4, .97). Morse reported that 
there was a statistically significant difference between the class I and class II skills in mean 
difficulty level, with the difference being nearly one-half a logit (standard deviation), such that 
the class I skills were the easier to demonstrate. 

Across the studies reported, most may be characterized as not directly addressing the 
issue of relative difficulty of test-wiseness skills, and most involved only a few such skills. 
The purpose of the present study was to investigate the relative difficulty of the 3even test- 
wiseness skills measured by the Gibb Experimental Test of Testwiseness . 

Method 

Subjects 

Participants were 243 undergraduate students (62 men, 178 women, 3 unidentified by 
gender) from three universities. Forty-one (17%) were African-American students, four (2%) 
were Asian-American, 191 (79%) were Caucasian, three (1%) were Hispanic, and the other 
four were unidentified by ethnicity. The mean age of participants was 22.5 years (SD = 5.2). 
The self-reported mean grade point average was 3.0 on a four-pomt scale (SD = 0.5). All 
participants volunteered to enter the study; as an incentive to participate, test-taking skill 
workshops were offered after completion of the study. 
Instrument 

The Gibb Experimental Test of Testwiseness (Gibb, 1964) was designed to measure 
seven cue-using skills, each with 10 four-option multiple choice items: (a) alliterative 
association (II.B.4), (b) incorrect/absurd alternatives (I.D.I), (c) specific determiners (n.B.3), 
(d) precision or qualification of answer (II.B.l.b), (e) longer correct alternative (II.B.l.a), 
grammar cue (II.B. l.h), and (f) cues elsewhere in the test (I.D.5). Gibb found that the test 
could distinguish the test-wiseness performance of trained from untrained undergraduate 



Er|c 6 



Relative Difficulty of Selected TW Skills 



8 



students. Samacki (1979) in his review on test-wiseness, declared the Gibb test to be the best 
available measure of test-wiseness. Miller, Fuqua and Fagley (1990) peifomied a principal 
components analysis on the seven subskills of the Gibb test and concluded that a two-factor 
strucnire seemed to represent the test well. Harmon, Morse and Morse (1994) reported on a 
confirmatory factor analysis of the Gibb test, showing that either a two-factor or a one-factor 
model could be asserted to represent the test. 
Procedure 

All participants were administered the Gibb Experimental Test of Testwiseness There 
were no special instructions regarding guessing, nor did subjects receive any training in test- 
wiseness principles prior to completing the test. Gibb (1964) reported that undergraduate 
students could easily complete the 70-item test within 45 minutes, and that time seemed ample 
for all but a very few of the participants. Separate, machine-scoreable answer sheets were 
used to record the responses, which were then scanned for further analysis. 

One-parameter logistic model (Rasch) item difficulty estimates were generated for each 
of the 70 items. These difficulty estimates were scaled as logits (log units), and arbitrarily 
centered at zero. On the logit scale, one logit is analogous to a standard deviation. Difficulty 
values below zero represent relatively easier items whereas values above zero represent 
relatively more difficult items. The Rasch difficulty values were then used as the data for a 
one-way analysis of variance (ANOVA), treating the seven test-wiseness skills as seven levels 
of the factor. Thus, the sample size for each skill was 10, representing the obtained Rasch 
difficulty estimate for each item measuring a given skill on the Gibb test. Significance tests 
were run at the .05 level. 

Results 

The mean Rasch difficulty estimates by test-wiseness skill measured on the Gibb test 
are presented in Table 3, and varied from -.33 (grammar cue) to .58 (specific determiners). 
Initial checks for homogeneity of variance (Levene's F(6,63 = 1.09, e = .379) and normality 
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(Lilliefors' adaptation of the Kolmogorov-Smirnov test, D-max = .086, 2 > 20) suggested 
no apparent problems with the usual ANOVA assumptions. The one-way ANOVA yielded a 
statistically significant result, F(6,63) = 4.47, p = .0008. Follow-up testing via Tukey's 
HSD procedure indicated thai the most difficult skill on average, specific determiners (M = 
.58) was statistically significantly more difficult than grammar cues (M = --33), longer correct 
alternatives (M = -.28), and absurd or unrelated alternatives (M = --25). No other difference 
was statistically significant. 

Discussion 

Some of the test-wiseness skills measured by the Gibb Experimental Test of 
Testwiseness do differ significantly in how easy they are to apply. Overall, the easier skills 
were observed to be the use of grammar cues, choosing the correct alternative when it was 
notably longer than other choices, and eliminating absurd or unrelated alternatives, which were 
found to be statistically significantly easier than avoiding alternatives containing specific 
determiners, such as all, everyone, or never. These findings are consistent with those that 
have been reported in other studies using young adults. 

Dunn and Goldstein (1959) found that specific determiner cues were the most difficult 
to demonstrate of those that they compared, while grammar and length of correct response 
cues had mean p- values that differed only slightly. Board and Whitney (1972) observed very 
little difference between mean item p- values for grammar and length cues. Weiten (1984) 
reported that grammar and length cues had the same mean item p-value, and were somewhat 
more challenging than were items involving absurd alternatives. Weiten' s study indicated that 
stem-answer resemblance, which alliterative association represents, was the most difficult of 
the four skills he compared, but was not very different from grammar or length mean item p- 
values. In the present sUidy, alliterative association was the second most difficult of the skills 
to demonstrate on the Gibb test. 

Results from studies using children are somewhat different; one reason for this 
difference is that grammar cues appear to be relatively more difficult for children to use than 




Relative Difficulty of Selected TW Skills 



10 



adults. Diamond and Evans (1972), Diamond et al. (1976), Caner (1986), and Morse (1980) 
reported that granmiar cue items were either the most or among the more difficult test- 
wiseness skills to demonstrate. Several studies, though, including Slakter et al. (1970a, 
1970b) and Diamond and Evans showed specific determiners to be the most difficult of the 
skills to apply. Morse noted that specific determiner items were close in mean difficulty to the 
averai?e for the entire set of items, though as a set they were the fourth most difficult among 
the 12 skills examined. 

These results suggest that not all test-wiseness skills are of equal difficulty. Further, 
the way young adults are able to respond to measures of test-wiseness may be qualitatively 
different, due perhaps to experience or cognitive strategies that have evolved over time, than 
the way children can respond. Researchers or trainers addressing test-wiseness should take 
into account such differences. Further research addressing the age factor as well as a wider 
sampling of test-wiseness skills from the Millman et al. (1965) taxonomy would aid in 
understanding how these findings might apply over a broader range of examinee characteristics 
and specific test-wiseness skills. 
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Table I 



Findings on Difficulty of Test-Wiseness Skills from Studies IJsinp AHnlfs 



Study 



Subjects 



Millman, Bishop & Ebel 
Test-Wiseness Skill 



Dunn & Goldstein 832 Army Specific determiner (I.B.3) 
(1959) trainees, about Length of correct alt. (II.B. 1 .a) 

200 per test Grammar cue (II . B . 1 . h) 
Length and Grammar cues 

Length of correct alt. (II.B.l.a) 
Grammar cue (II.B. 1 .h) 

Stem-answer resemblance (II.B.4) 
Grammar cue (II.B. 1 .h) 
Absurd alt's (I.D.I) 
Length of correct alt (II.B. 1. a) 



Board & Whitney 160 under- 
(1972) graduates 



Weiten (1984) 



54 under- 
graduates 



Mean item 

|J Value 


Mean 
difference 


.53 


.03 


.59 


.07 


.55 


.03 


.61 


.09 


.68 


.01 


.67 


.00 


.59 


.10 


.60 


.03 


.71 


.18 


.60 


.09 
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Table 2 



Findings on Difficulty o f Test-Wiseness Skills from Studies Using Children 



Study 



Millman, Bishop & Ebel 
Subjects Test-Wiseness Skill 



Slakter et al. 
(1970a) 



2361 students 
grades 5-11 



Mean item 
D-value 



2^ 
1 
3 
4 



Slakter et al. 
(1970b) 



Diamond & Evans 
(1972) 



76 high school 
seniors given 
testwiseness 
training 

95 suburban 
6th graders 



Diamond et al. 
(1976) 



McMorris et al. 
(1972) 

Carter (1986) 



Morse (1980) 



76 inner-city 
5th and 6th 
graders 



494 suburban 
11th graders 

312 7th 
graders 



2860 5th and 
6th graders 



Mean 
difference 



Stem-answer resemblance (II.B.4) 
Incorrect/absurd options (I.D. 1) 
Similar options (I.D. 2) 
Specific determiners (II. B. 3) 

Stem-answer resemblance (II.B.4) .82 

Incorrect/absurd options (I.D.I) .86 

Similar options (I.D. 2) .80 

Specific determiners (II.B.3) .75 

Length of correct alt (II. B. 1 .a) .53 

Grammar cue (II . B . 1 . h) .35 

Specific determiners (II.B.3) .50 

Alliterative association (II.B.4) .77 

Overlapping distractors (I.D.2) .45 

Length of correct alt (II.B. 1 .a) .39 

Grammar cue (II.B. l.h) .23 

Specific determiners (II.B.3) .31 

Alliterative association (II.B.4) .45 

Overlapping distractors (I.D.2) .34 

Stem-answer resemblance (II.B.4) 
Grammar cue (II.B. l.h) 
Length of correct alt (II.B. 1. a) 

Choice "C" keyed alt (II.B. 1 .d) .69 

Length ov correct alt (II. B. 1 .a) . 50 

Alliterative association (II . B .4) .55 

Grammar cue (II.B. l.h) .27 

+/- options (II.B. l.f) .80 

Guess (I.C.I) -2.15^ 

Look over test (I.A.2/I.A.3) -1.20 

Read & follow directions (I.B. 1) -.80 

Cues elsewhere in test (I.D.5) -.23 

Change answer if wrong (I.A.5) -.02 

Known wrong answers (I.D. 1) -.02 

Specific determiners (II.B.3) .04 

Stem-answer resemblance (II.B.4) .10 

Grammar cue (II.B. l.h) .14 

Length of correct alt (II.B. La) .22 
Budget time & check progress (I. A. 2) .74 

Similar alternatives (I.D.2/I.D.4) .97 

Note: * Exact values were not given, these represent rank order from easiest to hardest 
These represent differences between the trained students and 76 untrained smdents 
These represent Rasch difficulty values, expressed as logits. 
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Table 3 

Summary Statistics for Rasch Difficulty Estimates bv Test-Wiseness Skill 



Skill 



Grammar cue 
Longer correct alt. 
Absurd alternatives 
Cues elsewhere on test 
More precise/qualified alt. 
Alliterative association 
Specific determiners 



Overall 
Note : 10 items per skill. 

Table 4 

ANOVA Summary Table 



Source 
TW Skill 



SS 



Residual 



Total 



6.49 
15.24 



Mean 



-0.33 
-0.28 
-0.27 
-0.01 
0.03 
0.25 
0.58 



0.00 



df 



MS 



SD 



0.39 
0.43 
0.70 
0.61 
0.32 
0.36 
0.52 



0.56 



21.73 



6 
63 



1.08 
0.24 



4.47 



Drob(F) 



.0008 



69 



Note : Eta squared = .30 



ERIC 



i D 



